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Abstract — This work introduces a novel nonparametric density index defined on graphs, the Sum-over-Forests (SoF) density index. It 
is based on a clear and intuitive idea: high-density regions in a graph are characterized by the fact that they contain a large amount 
of low-cost trees with high outdegrees while low-density regions contain few ones. Therefore, inspired by [1], a Boltzmann probability 
distribution on the countable set of forests in the graph is defined so that large (high-cost) forests occur with a low probability while 
short (low-cost) forests occur with a high probability. Then, the SoF density index of a node is defined as the expected outdegree of this 
node in a non-trivial tree of the forest, thus providing a measure of density around that node. Following the matrix-forest theorem [2 , 
[3] and a statistical physics framework, it is shown that the SoF density index can be easily computed in closed form through a simple 
matrix inversion. Experiments on artificial and real data sets show that the proposed index performs well on finding dense regions, for 
graphs of various origins. 
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1 Introduction 

1 .1 General introduction 

DENSITY is an important concept in graph analysis 
and has been proven to be of particular interest 
in various areas such as, for example, social networks, 
biology and World-Wide-Web |4)-(£|. 

The task of identifying dense regions on a graph can 
be based on various concepts (degree of a node, cliques, 
cores, etc.) leading to various approaches (see Section 
|1.2| . The key concept on which our approach is based is 
forest enumeration and, in particular, the matrix-forest 
theorem |2], |3], an extension of the well-known matrix- 
tree theorem (see, e.g., |7|). More precisely, the method 
developed in this paper, inspired by |1|, |8]-|10| (based 
on paths instead of forests), relies on the enumeration 
of all the possible forests in the graph, therefore leading 
to the definition of a new density index which will be 
called the Sum-over-Forests (SoF) density index. This 
measure has a clear and intuitive interpretation: when 
enumerating all the possible forests in the graph, a node 
will be considered as having a high density index if it is 
part of a tree of many - preferably low-cost - forests, 
and has a high outdegree within this forest. Indeed, 
if a region has a high density, it will contain a large 
number of trees - and therefore forests - so that the 
nodes belonging to that region will be part of many 
forests and have a high outdegree. Those nodes will thus 
obtain a high SoF density index. 
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In order to compute this index, we first define a 
Boltzmann probability distribution on the countable set 
of forests in the graph by adopting a statistical physics 
framework. This distribution has the desired property 
that high-cost forests occur with a low probability while 
low-cost forests occur with a high probability. As in 
statistical physics, it depends on a parameter, 9 = 1/T, 
controlling the temperature T - and thus the entropy 
- of the system. When T is low, only low-cost forests 
are taken into account (high-cost forest having a negli- 
gible contribution) while for high values of T, high-cost 
forests are as important as the low-cost ones (uniform 
distribution). 

In a second step, the SoF density index of a node 
is defined according to this probability distribution. 
Roughly speaking, it corresponds to the expectation of 
the outdegree of this node, averaged over all the forests 
(the expectation is taken on all the possible forests). 
Technically speaking, the SoF density index is obtained 
by taking the first-order derivative of the partition func- 
tion associated to the system. It is shown that it can be 
computed in closed form by inverting a n x n matrix 
depending on the immediate costs assigned to the arcs. 

1 .2 Related work 

This section provides a short survey of the related work 
aiming at finding dense regions on graphs. 

A well-known approach for finding high density re- 
gions on graphs relies on identifying dense, highly con- 
nected subgraphs like cliques, plexes, cores, etc. (see, e.g., 
1 11 1). Cliques are completely connected subgraphs of the 
original graph |4|. Unfortunately, finding all the cliques, 
or the maximal clique in a graph is NP-complete. As the 
notion of clique is very restrictive (if an arc is missing, 
then the subgraph is no more considered as a clique), 
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other ideas relaxing this notion appeared, such as plexes 
p"2| . A k-plex is a subgraph containing n nodes where 
each node is connected to at least n — k other nodes. 
Finding k-plex is alas as hard as finding cliques |TT| . 
Cores are similar to plexes, but instead of specifying 
how many links are missing to produce a clique, nodes 
inside k-core only have to present a degree superior to k 
[ fl3| . All nodes of the core are then connected to at least 
k other members of the core. Contrarily to cliques and 
plexes, cores can be computed in polynomial time, and 
there even exists linear-time algorithm computing the 
core structure of a network 1 14] . A generalization of the 
notion of core, called the generalized k-core, is based on 
other vertex properties than the degree (in/ out degree, 
clustering coefficient,...) and can also be found in |14|. 
Our SoF density index could be used in conjunction with 
a generalized k-core. 

Density-based clustering methods use a measure of 
density on graphs as an intermediary step for computing 
clusters. DBSCAN [15], a widely used clustering algo- 
rithm, computes the local density around a node as the 
number of neighbours in a sphere of a certain radius 
around that node. Mode-seeking methods, like Mean 
Shift (161, compute the modes of a probability density 
function to find high density areas. These methods were 
originally intended to be used in the feature space of 
the data, but adaptations to graph data were recently 
proposed |17|-|20|. 

Another approach for finding dense zones is to com- 
pute a density index (or score) on the nodes of a graph. 
One of the most intuitive density index is the degree 
of a node (on undirected graphs, in/ out degree on 
directed graphs) defined as the number of links a node 
has. Indeed, the larger the number of neighbours of a 
node, the higher the density around it. This measure 
is then purely local, taking only into account the direct 
neighbours. The strength of a node is an extension of the 
degree to weighted graphs, computing the sum of the 
weights borne by the arcs of the neighbouring nodes. 
When those weights are all equal to one, the strength 
reduces to the degree. The clustering coefficient pi] of 
a node i is also a notion related to the degree. It counts 
the number of connected neighbours of i, divided by 
the total number of possible connections between those 
neighbours. This measure was extended to weighted 
graphs in [22]. 

Similarly, the Sum-over-Forests (SoF) density index 
developped in this paper computes a density score on 
nodes by enumerating forests on a graph using the 
matrix forest theorem |2|. This method is based on a 
sum-over-forests statistical physics framework. 

1 .3 Contributions and organization of the paper 

This work has three main contributions: 

• It defines a new density index on nodes of a directed 
graph. 

• It shows how this density index can be computed 
efficiently through a statistical physics framework 



from the immediate costs associated to each arc by 
inverting a n x n matrix. 
• It shows through experiments on artificial and real 
data sets that the SoF index is an accurate tool for 
identifying dense regions on graphs. 

Section 2 introduces the necessary background and no- 
tation. In Section 3, the probability distribution on the 
set of forests - a Boltzmann distribution - is defined. 
Section 4 introduces our index and shows how it can be 
derived analytically from the partition function. Section 
5 explains how the partition function can be computed 
exactly from the immediate costs while Section 6 derives 
the formulas for computing the density index. Section 
7 applies the index to the identification of dense areas 
on graphs from various origin. Concluding remarks and 
possible extensions are discussed in Section 8. 



2 Background and notation 

Consider a weighted directed graph or network without 
self-loops, G, not necessarily strongly connected, with a 
set of n nodes V (or vertices) and a set of arcs E (or 
edges). To each arc linking node k and k! , we associate a 
positive number Ckk' > representing the immediate 
cost of following this arc. The cost matrix C is the 
matrix containing the immediate costs Ckk' as elements. 
If, instead of C, we are given an adjacency matrix with 
elements akk' > indicating the affinity between node k 
and node k', the corresponding costs could be computed 
from c^k' = 1 /o-kk' ■ Notice, however, that other relations 
- other than the reciprocal relation - between affinity 
and cost could be considered as well. The adjacency 
matrix containing the elements akk' is denoted by A, 
while the Laplacian matrix of a graph having adjacency 
matrix A is L(A) = D — A, where D = Diag(A T e) is a 
diagonal matrix containing the column sums of A. Here, 
e is a column vector full of Ts. Moreover, if the graph is 
undirected, it is assumed that, for each arc, there exists 
directed links in the two directions k —> k' and k' —> k. 

The objective of the next sections is to define the 
probability distribution on the set of forests as well as the 
density index. Before diving into the details, let us briefly 
describe the main ideas behind the model. In a first step, 
the set of forests in the graph is enumerated through the 
matrix-forest theorem and a probability distribution is 
assigned to each individual forest: the larger the forest, 
the smaller the weight of its contribution, given that 
isolated nodes do not contribute. This probability distri- 
bution depends on a parameter, 8 = 1/T, controlling the 
smoothing level carried out in the graph: when 9 is large, 
only the lowest-cost forests are considered while when 9 
is small, higher-cost forests are also taken into account. 
In a second step, the expected outdegree each node 
takes in a forest is computed through a sum-over-forests 
statistical physics formalism, providing a measure of 
density on the set of nodes. 
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Fig. 1. A directed graph G in which arc costs are uniformly 1. 




o — o 



(a) High-cost forest 931. 



(b) Low-cost forest tp2- 



Fig. 2. Examples of forests on graph G containing two trees. Isolated 
nodes are not displayed since they do not contribute to the density index. 

3 A BOLTZMANN DISTRIBUTION ON THE SET 
OF FORESTS 

The present section describes how the probability dis- 
tribution on the set of forests is assigned. To this end, 
let us define the set of rooted forests (p that can be 
defined in the graph G as T = {cpi, cp2, ■ ■ ■ }. Intuitively, 
a rooted forest is an acyclic subgraph of G that has 
the same nodes as G and one marked node (a root) in 
each component (see |2|, |3] for details). In the directed 
case, diverging forests are considered, that is, forests 
containing diverging rooted trees (i.e., trees that contain 
only directed paths from the root to all the other nodes). 
Now, as we are dealing with directed graphs, diverging 
rooted trees and forests will simply be referred to as trees 
and forests. The total cost of such a forest ip is defined as 
the sum of the individual costs of the arcs belonging to ip, 
C(ip). On the other hand, the total weight of such a forest 
ip is defined as the product of the individual weights (the 
elements of the adjacency matrix) of the arcs belonging 
to ip. A forest with no arc (containing only individual 
nodes without any connection) has a total cost and a 
total weight of 1. 

A Boltzmann probability distribution is defined on 
the set T: 



exp[-0C( V )] 



E exp i 



-ec{ip)\ 



(i) 



where 8 is the inverse temperature. Thus, as expected, 
low-cost forests ip (having small C((p)) are favored in that 
they have a large probability of being chosen. Indeed, 
from Equation dTl, we clearly observe that when 8 — > + , 
the forest probabilities tend to a uniform probability. 
On the other hand, when 8 is large, the probability 
distribution defined by Equation ([TJ is biased towards 
low-cost forests (the most likely forests are the lowest- 
cost ones). Notice that in Equation Q isolated nodes 
(with no ingoing or outgoing link) do not contribute to 
the probability. In the sequel, it will be assumed that the 
user provides the value of the parameter 8. 

For illustration, the simple graph G shown in Figure [T] 



is analysed. Figure [2] represents examples of respectively 
a high-cost forest ipi and a low-cost forest <p2 on G. The 
cost associated to ipi is 5, as this forest contains five arcs 
with a cost equal to 1. Similarly, the cost of (p2 is 2. The 
numerator of the Equation ijlj for ipi becomes exp [—05], 
the numerator for ip2 exp [—02], while the denominator 
is the same for both forests. For small values of 8, those 
numerators tend to 1 and the probabilities to the uniform 
distribution. For high values of 8, the probability of the 
lower-cost forest 1^2 is higher than the probability of the 
higher-cost forest tpi. 

4 The SoF density index 

By following arguments inspired from |9|, it is now 
shown that the density index can be computed from a 
quantity appearing in the denominator of Equation 
defined as 



E 



exp \-0C{<p)] 



(2) 



and which corresponds to the partition function in 
statistical physics (se e p3| or any textbook in statistical 
physics; for instance [24 1, |25|). For this purpose, let us 
further define the free energy F in the usual way |24|, 
|25| as 



F 



log(Z) = -Tlog(Z) 



(3) 



where T = 1/8 is the temperature of the system. The 
expected number of times a link k — » k' is present in a 
forest can easily be computed through 



rj(k,k') = 



dF 

dc, 



1 d{\ogZ) 



dc 



A' A'' 



(4) 



= E exp[^g M %;fc;fc/) (5) 

= J2n<P)5(^k,k>) (6) 

where 5((p;k,k') is a Kronecker delta indicating if the 
link k — !• k' is present in forest ip, and thus if the link is 
part of forest (p. The expected outdegree of node k on a 
forest, which defines the SoF density index, is 

/ n \ n 

dens(fc) = £ PQp) ( ]T %; k, k') ) = £ f](k, k') (7) 



\k' = l 



k' = l 



and corresponds to the sum of the contributions of the 
arcs issued from node k. 

In the next section, we show that the partition function 
can easily be computed from the cost matrix. 

5 Computation of the partition func- 
tion z 

By using the matrix-forest theorem j2j, (3j, let us now 
show how the partition function Z (Equation |2j) can 
be computed exactly from the immediate costs. Indeed, 
let us assume a graph G characterized by an adjacency 
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matrix A containing the weights on the arcs. From 
the matrix-forest theorem (see |2|, lemma 2, or |3| for 
details), det(I + L(A)) is the sum of the total weights 
of all the rooted (diverging in the directed case) forests 
ip e T that can be extracted from the graph. The total 
weight of a particular rooted forest ip is the product of 
the weights of the individual arcs defining it. 

Let us now apply this concept to a new matrix W 
defined from the cost matrix, C, 



W = exp [-6C] , 



where the logarithm/exponential functions are taken 
elementwise. Thus, the elements of matrix W are 
exp [— 6c kk i\. Now, if we set as adjacency matrix A = W, 
the total weight of a rooted forest <p is the product of the 
individual weights defining it, i.e, J J 



fc,/c' :k^k' £ip 



O-kk 1 



Il^efOPW = e M-°J2k,k':k^k'e v c kk'\ = 
exp[—6C((p)]. We can immediately deduce from the 
matrix-forest theorem that det(I+L(W)), where L(W) = 
Diag(W T e) - W, is equal to ex P [-0C(tp)] = Z. 

Therefore, 



Z = det(I + L(W)), with W = exp [-9C] 



(9) 



This result is used in next section in order to derive 
the SoF density index. 

6 Computation of the SoF density index 

Now that we have seen how to compute the partition 
function Z, we turn to the computation of the density 
index that can be deduced from Z thanks to Equations 
@J and 0. 

We thus have to compute the derivatives of Z (Equa- 
tion Q) in terms of c kk t (see Equation Q) in order 
to obtain the different quantities of interest. Now, it is 
well-known (see, e.g., (26), (27|) that <91og(det(X))/<9t = 
trace(X -1 ^r). Thus, for the expected number of times 
the link k — > k' appears in a forest, we obtain 



rj(k,k') 



dF 



iaiog(det(I + L(W))) 



dckk 
1 



dc 



kk' 



dc 



kk' 



1 ,„0L(W), 
-trace(Z— ^ — '-) 



1 



= — -trace(Z 



dc kk > 
d(D - W) , 



(10) 



dc k k' 

where the matrix Z is defined as 

Z = (I + L(W))- 1 = (I + (Diag(W T e) - W))" 1 (11) 

Now, we easily find that 9W/9cfcfc/ = —9w kk 'e k e^ k , anc ^ 
dH/dckk' = -6w k k>e k ,el, so that 



<9L(W) d(D - W) 



0w kk >(e k >el, -e k e k ,), (12) 



dc kk > dc kk > 

where is a basis column vector with zeroes every 
where except in position k where there is a 1. 



Thus, by defining z k = colfe(Z) as column k of matrix 

Z, 

fj(k, k') = trace(u) fefe 'Z(e fc -e^ - e fc e£,)) 

= w kk drace(z kl el,) - trace (z fe e£,) 

= W kW Z kl y - W kk >Z k > k (13) 

Therefore, the expected outdegree of node k - the SoF 
density index of node k - is 



( 8 ) dens(fc) = V T)(k,k') = y^(w kk ,z k , k , -w kk ,z k , k ) (14) 



k' = l 



k'=i 



where we used Equations Q and p3) . The nx 1 column 
vector containing the elements dens(fc) will be called d, 
with 

d = W diag(Z) diag(WZ) (15) 

and where diag(X) is a column vector containing the 
diagonal of matrix X. The SoF index can therefore be 
found by applying the following, simple, procedure: 

1) Compute the W matrix through Equation |8|. 

2) Find the matrix Z from Equation ( [TT| . 

3) Compute the column vector d containing the SoF 
index of each node with Equation ( [T5| . 

7 Experiments 

In this experimental section, the SoF density index is 
assessed on the identification of dense regions on graphs. 
Unlike classical clustering methods, the goal here is not 
to find an exact partition of the data, but only regions 
of graphs where the nodes are tightly aggregated, sug- 
gesting some community-like structure. 

7.1 Datasets 

The performance of the SoF density index is assessed on 
ten datasets belonging to four groups: 3-communities, 
10-communities, S-Sets and NewsGroup datasets. 

The 3-communities (resp. 10-communities) datasets 
are artificial datasets we built: each one is made of three 
(resp. ten) clusters, created using gaussian distributions 
N(fi, a), [i being the mean (the center of the cluster) 
and <7 2 the variance of the data. Each cluster is made 
of 500 nodes, lying in two dimensions. Three values of 
a (illustrating various degree of overlapping between 
the communities) were used to build graphs in the 3- 
communities case: 0.05, 0.1, 0.5 (the standard deviation 
is the same in each direction, giving isotropic communi- 
ties). For the 10-communities datasets, the a values are 
different in the two space directions, (x,y). These values, 
called a x and a y are reported in Table [l] for two sets : S± 
with small overlapping and S2 with strong overlapping. 

The S-Sets |28| include two datasets: S2 and S4. They 
are also based on artificial data and are composed of 5000 
two-dimensional observations each, grouped in 15 clus- 
ters of various shapes. Figure [7] illustrates S2, with well 
separated clusters and S4, showing more overlapped 
ones. 
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(<r x , a y ) (standard deviations) values for the 1 0-communities datasets, for 
two degrees of overlapping between the clusters (S1 small overlapping, S2 
strong overlapping). 



the density index on the nodes. Indeed, trying to proceed 
inversely (computing the densities before the diffusion 
map embedding) is not visually accurate: during the em- 
bedding, the nodes are spatially rearranged and the color 
of the nodes (indicating high or low density, see below) 
do not reflect the true density of the 2-D embedding. 

The cost matrices used in the evaluation of the SoF 
density index are then computed as the reciprocals of 
the affinity matrices constructed above. 



Finally, graphs generated from the Newsgroup dataset 
are used. This dataset is originally composed of about 
20,000 unstructured documents, taken from 20 discus- 
sion groups (newsgroups) of the Usernet diffusion list, 
and composed of 20 classes. For our experiments, three 
subsets related to different topics are extracted from 
the original database (NewsGroupl, 2, and 3) |29|. The 
graphs of documents were built by sampling at random 
about 200 documents in each of three classes from three 
different topics. 

7.2 Graph construction 

We constructed the graphs corresponding to the 3/10- 
communities and the S-Sets datasets using two classical 
methods: the e-graph and the k-nearest neighbours (k- 
NN). 

The e-graph computes the euclidean distance between 
each pair of observations in the dataset and transforms 
it into an affinity using 



exp 



4 



(16) 



where d%j is the euclidean distance between nodes i and 
j, and a 2 is the variance of the distances between all the 
observations in the dataset. The nodes are then linked 
to others only if they show an affinity superior to a 
certain threshold (80, 90, 95, and 99 percentiles were 
used). The resulting graphs are undirected, and both the 
weighted case (where arcs bear the nodes affinities) and 
the unweighted case are investigated. 

The k-NN graph construction method simply links a 
node to its k nearest neighbours, i.e., those who have 
the highest affinity with that node. This relation is not 
symmetric, giving birth to directed graphs. We transform 
them into undirected graphs using 



(A, A 1 



(17) 



where A is the adjacency matrix of the created graph, 
and the maximum operator is taken elementwise. 

For the NewsGroup datasets, the graphs were already 
build 1 29 J and only the adjacency matrices are at our 
disposal. To visualize those graphs, we use the diffusion 
maps embedding method |30|-|32| in two dimensions 
(see Figure 111, whose output is the new spatial coordi- 



nates of the nodes. The corresponding graphs are recon- 
structed with the e-graph method, allowing to compute 



7.3 Evaluation methods 

We use two methods to evaluate to which extent the 
high density areas are well identified: Spearman's cor- 
relation (only applicable to 3/10-communities datasets) 
and visual checking (applicable to all datasets). 

Firstly, since the probability density function is known 
for every node of the 3/10-communities datasets (i.e., the 
exact parameters' values of the gaussian distributions are 
known), we compute Spearman's correlation between 
those true densities and the SoF densities. 

Secondly, we perform a visual checking on the graphs 
by superimposing the density index on the representa- 
tion of the nodes. This is done by assigning each node a 
color: from dark blue for nodes presenting a low density 
value to dark red for nodes presenting a high density 
value. 

Concerning the tuning of the 9 parameter in the SoF 
method, we used the correlation method on the 3/10- 
communities graphs. The parameter's value giving the 
highest correlation score (for threshold graphs, 9 = 5 
and for k-NN graphs, 9 = 50) is then used for the 3/10- 
communities as well as for the S-Sets and Newsgroup 
datasets. 

The results obtained with the SoF density index are 
finally compared with two other measures for identify- 
ing dense zones: the strength (Str) and the clustering 
coefficient (CC). 

7.4 Results 

Correlation results 

The correlation results for 3/10-communities are dis- 
played in Figure [3] 

When using the k-NN for constructing graphs, the 
SoF density index is clearly superior to the strength and 
to the clustering coefficient (the latter performs badly 
in every situation, and is not further considered in the 
sequel of this section). This may be explained by the 
fact that the information concerning the connectivity is 
useless in this case, as all the nodes have theoretically 
almost the same degree. The SoF density index then 
makes a better use of the affinities borne by the arcs 
of the graphs than the strength does, which explains its 
superior results. 

When using the e-graph construction method, the 
results are not so clear. The results obtained by the 
strength and the SoF index for the 3-communities case 
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have a correlation with the true density of almost one 
and are practically identical (only the weighted case is 
represented here in Figure 3(b) as the unweigthed case 



gives similar results). The SoF index is clearly better 
on the 10-communities datasets in the weighted case 
and for low threshold (e) values. Indeed, the number of 
neighbours increases dramatically when the threshold is 
low, and again the information of connectivity becomes 
useless, nodes having each a large degree value. The 
only useful information are the affinities and, as before, 
the SoF density index uses it in a more efficient way 
than the strength. When the threshold is higher the two 
measures converge to the same value. In the unweighted 
case (no affinity information), the SoF index and the 
strength behave similarly: the correlations are low for 
low threshold values and increase when the threshold 
increase. The good results of both the SoF index and 
the strength in the 3-communities dataset with threshold 
and unweighted arcs are probably due to the fact that 
these datasets are quite smaller and simpler to handle, 
having only 3 clusters instead of 10, distributed with 
gaussians of the same variance in all directions. 

A first conclusion can de drawn so far: the SoF density 
index is much more stable and independent from the 
type of graph than the strength (and the clustering 
coefficient). 



Visual results 

The visual results confirm the correlation results de- 
scribed above. As there are many different cases for the 
3 /10-communities datasets, only few visual examples, 
representative of the overall behavior of the density 
measures, are shown. For instance, Figure [5] shows the 3- 
communities datasets (threshold 95, weighted) with the 
SoF density index superimposed. It can be observed on 
this simple example that the SoF density index is able 
to recover the dense areas of the clusters. In the 10- 
communities datasets, a visual confirmation is given in 
Figure [6] The SoF density index is visually very close to 
the true density and the highly dense regions are well 
identified. 

The dense regions on the S-Sets are also well iden- 
tified. Figure [8] shows that the SoF density index is 
able to recover the 15 densest areas on the S2 and S4 
graphs, even if they are tightly aggregated. Str and 
CC are illustrated on Figures [9] and 10 showing poor 
results, mainly on S4. The Newsgroup datasets con- 
firm the results obtained so far (Figure [12) . For SoF 
density index and Str, the three clusters are recovered 
on those graphs, except on NewsGroup3 where two 
clusters are too tightly intrictated to be differentiated. 
The CC does not identify correctly the dense areas, like 
in the 3-communities case. Figures concerning S-Sets 
and Newgroup graphs show only results obtained for 
weighted graphs, as those results are essentially identical 
for unweighted graphs. 
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(k-NN) or Threshold (Th) methods. 
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Fig. 4. 3-communities datasets for various a values with true density 
superimposed. 
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Fig. 9. S-Sets datasets with strength superimposed (Threshold 95, 
Weighted). 
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Fig. 5. 3-communities datasets for various a values with the SoF density 
index superimposed. 
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Fig. 6. 10-communities dataset (low sigma values S1) with true density 
(left figure) and SoF density index (right figure) superimposed. 
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Fig. 1 0. S-Sets datasets with CC superimposed (Threshold 95, Weighted). 



8 Conclusion and Perspectives 

This work introduced a new density index on the nodes 
of a graph. The main idea behind the model is that 
a node has a high density index if it is present on a 
large number of (preferably low-cost) forests, together 
with a high outdegree. This model depends on a meta- 
parameter 9, biasing gradually the forests probabilities 
from uniform towards low-cost forests. A sum-over- 
paths statistical physics framework is used in order to 
derive the form of the index in terms of the immediate 
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Fig. 8. S-Sets datasets with SoF density index superimposed (Threshold 
95, Weighted). 
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Fig. 12. NewsGroup datasets with SoF density index superimposed 
(Threshold 95, Weighted). 
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costs defined on the arcs. It can be computed efficiently 
by inverting anxn matrix, where n is the number of 
nodes, leading to an overall time complexity of 0(n 3 ). 

The application of the SoF density index to the task of 
searching dense areas on graphs shows that it performs 
well, being able to recover all the high density regions 
- corresponding to the center of clusters - on different 
graphs. Moreover, the correlation results between the 
SoF density index and the true density (when available) 
are often close to one. The SoF density index also gives 
more stable results than the strength regarding the way 
a graph is constructed. 

In the future, this index could be used together with 
a density-based clustering method, for instance a mode 
seeking algorithm on graphs (like in [20 ]), for clustering 
tasks. We will also investigate the application of the pro- 
posed technique on large graphs, as in [33]. Indeed, the 
Sum-over-Forests measure only depends on the diagonal 
of the inverse matrix Z (this can be easily deduced from 



Equation |15l Moreover, the matrix (I + L(W)) 1 is 



diagonally dominant). In this case, scalable methods can 
be used for computing the diagonal of Z (see, e.g., |34|- 
1361). 
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